Replication files for "Imputation in U.S. Manufacturing Data and Its Implications for Productivity Dispersion" by T. Kirk White, Jerome P. Reiter and Amil Petrin, for publication in The Review of Economics and Statistics.

The microdata from the U.S. Census Bureau's Census of Manufactures is confidential and cannot be released to the public. However, qualified researchers can submit proposals to gain access to the confidential microdata via the Federal Statistical Research Data Centers administered by the Census Bureau's Center for Economic Studies (CES). The process for submitting a research proposal is described on the CES website: https://www.census.gov/ces/rdcresearch/.

The SAS programs were run using SAS version 9.2, executing on the Linux 2.6.18-406.el5 (LIN X64) platform.
 
The programs can be run in the following order to produce the results in the article: 

0.  ASMimplibs.sas.  This file is NOT included in the replication files.  It contains the paths to the SAS library names 
    used in the SAS programs.  Census Bureau Center for Economic Studies policy does not allow path names on 
    Census Bureau servers to be released publicly.  However, this README file describes the flow of data from raw Census data to final datasets.  Access to the ASMimplibs.sas will be provided upon request to researchers with approved access to the confidential Census of Manufactures microdata via the Federal Statistical Research Data Centers.

1. select_vars_all_industries_from_cmf.sas reads in variables from the raw 2002 and 2007 Census of Manufactures (CMF) files (which are stored as SAS datasets on the computers of the Federal Statistical Research Data Centers) and produces tables 1 and 2 in the paper.

2. Compute_imputation_rates_by_type_2002.sas read in the raw 2002 CMF data, dropping administrative records plants and untabulated records and produces plant counts of imputation by type for each variable used to compute TFP.  Output datasets from this program are used in Make_imputes_missing_all_cmf_2002_07.sas (see step 4 below).  

3. Compute_imputation_rates_by_type_2007.sas. Basically the same as the 2002 program above, but using the 2007 CMF.  The program also computes the statistics reported in footnote 6 of the REStat article.  This program and the program in step 2 also compute the percentage of plants with imputed data for any variable used to compute TFPR, reported in the abstract and the introduction: 79% in 2002 and 73% in 2007.

4. Make_imputes_missing_all_cmf_2002_07.sas.  Using the output datasets from steps 2 and 3, this program deletes (makes missing) any value imputed using the industry average ratio method or the univariate regression method each for the following variables: total employment (TE), total value of shipments (TVS), total inventories at beginning and end of year (TIB and TIE), total cost of materials (CM), cost of purchased electricity (EE), cost of fuels (CF), production worker hours (PH), and production worker wages (WW).
The output SAS datasets are gooddata_all_inds02 and gooddata_all_inds07.  The output datasets are used as inputs to the program in step 5, Create_NAICS_codes_dataset_for_CART.sas.

5. Create_NAICS_codes_dataset_for_CART.sas. This short program just creates 2 .csv files with all the 6-DIGIT NAICS codes that are used in the "gooddata_all_inds02" and "gooddata_all_inds07" datasets.  These will be used in the CART imputation programs that impute for missing data in every industry.

6. export_for_CART_imputation.sas. This programs reads in the SAS datasets "gooddata_all_inds02" and "gooddata_all_inds07" from step 4, transforms some of the variables to enforce logical constraints on the CART imputation models (e.g., production worker wages cannot be greater than total salaries and wages for the same plant) and exports the data to .csv files. The .csv files are inputs to the R scripts in steps 7 and 8.

The R scripts below were run using R version 3.1.3 (2015-03-09) -- "Smooth Sidewalk"
Platform: x86_64-redhat-linux-gnu (64-bit).
The scripts call the treeMI.R program, written by Lane Burgette and Jerry Reiter and described in Burgette and Reiter (2011).  
The treeMI.R program and two programs it calls, treeDraw.R, and bayesianboot.R, are also included in the replication 
folder, although they were not written for this paper.  Documentation for these programs are in treeMI.Rd, treeDraw.Rd and bayesianboot.Rd, also included in the replication folder.

7. CART_mi_script_all_mfg_2002_311.R, CART_mi_script_all_mfg_2002_312-315.R, CART_mi_script_all_mfg_2002_316_321.R, CART_mi_script_all_mfg_2002_322_323.R, CART_mi_script_all_mfg_2002_324_327.R, CART_mi_script_all_mfg_2002_331_332.R, CART_mi_script_all_mfg_2002_333_339.R.  These R scripts use the treeMI R package to create 100 CART-completed datasets. In each of the 100 datasets, each missing value in the "gooddata_all_inds02"  The imputation models are run separately for each 6-digit NAICS industry,
and the different programs run the model only for the 3-digit NAICS sectors given in the program.  These programs could obviously be combined into one script--it would just take longer to run. 

8. CART_mi_script_all_mfg_2007_all_inds.R creates 100 CART-completed datasets for the 2007 data in gooddata_all_inds07.

9. import_CART_imputes_2002_all_inds.sas, import_CART_imputes_2007_all_inds.sas.  These programs import the CART-completed .csv files that are the ouputs of the R scripts in steps 7 and 8, merge on plant-level identifiers and other data, and save them as SAS datasets.

10. Calculate_IQRs_of_TFP.sas.  Calculates the within-industry-year TFPR IQRs for each manufacturing industry from (i) Census Bureau completed data and (ii) "non-imputed" data.    In addition to the output datasets from earlier steps, this program also requires two public-use datasets: 
(i) The NBER-CES productivity database, which is included in the replication folder as naics0205nberces.csv.  The latest version of the NBER-CES productivity database can be downloaded from www.nber.org/nberces. 
(ii) bea_naics.sas7bdat, which contains capital stock deflators constructed from publicly available BEA data.  This dataset is constructed by the program create_BEA_capital_file.sas using the files  BEAeastructure.csv, BEA_fixed_assets_csv.csv, gkheq.csv, 
gkhst.csv, nkceq.csv, and nkcst.csv.  These files are also described in README_BEA_files.txt.  All of these files are included in the replication folder for the REStat article.   

11. Calculate_IQRs_of_TFP_CART_MI.sas. This program calculates the within-industry-year TFPR IQRs for each manufacturing industry for each of the 100 implicates of CART-completed data.  Then for each industry-year it computes the mean TFPR IQR across the 100 implicates and saves the results as a SAS dataset.

12. Calculate_IQR_diffs_no_DA.sas.  This program uses the output datasets from steps 10 and 11 to calculate the differences in within-industry interquartile ranges (IQR) of TFPR for each industry-year for (i) the IQR computed from the "non-imputed" data minus the IQR computed from the Census Bureau completed data (i.e., the "cleaned" data) and (ii) the mean of the IQRs computed from 100 implicates of the CART-completed data minus the IQR computed from the Census Bureau-completed data.   The program then produces the summary statistics that are presented in table 3 in the REStat article.  The program also produces statistics reported in the abstract and the introductio of the article: that for 90% and 84% of the industries in 2002 and 2007, respectively, the within-industry IQR of TFPR increases as we move from Census-Bureau completed to "non-imputed" to CART-completed datasets.  The program also produces the following statistics mentioned in the introduction of the REStat article: in the CART-completed data, 66% (in 2002) and 51% (2007) of industries have TFPR IQRs that are at least 10 log points higher than they are for the same industry in the Census Bureau completed data.

13. START_3273_CONCRETE_2002_07.SAS. This program uses the 2002 and 2007 raw CMF product-level data 
       to select concrete plants. The output datasets are used as an inputs to step 14. 

14. select_concrete_plants_from_cmf_asm.sas selects the 2002 and 2007 concrete samples from the original Census of 
       Manufactures microdata.  The output dataset for 2007 is used as an input to step 18.  The output dataset
       for 2002 is used as an input to step 15.

15. merge_captured_with_cmf02.sas reads in the 2002 concrete dataset from 14 and merges it with the 2002 "captured" data.  This is the data as it was reported on the survey forms, before any editing or imputation by 
the Census Bureau.  This data is not typically available to FSRDC researchers, but it can be requested
and made available to any researcher with approved access to the 2002 Census of Manufactures data.  The merged output file is used as an input to step 18.

16. import_2002_CBP_data.sas imports data on 2002 county-level construction employment from the 
        public-use County Business Patterns employment file 
        Cbp02co.txt (described in README_cbp), modifies it to impute employment for employment ranges
        (following Syversion 2004, JPE) and saves it as the SAS dataset constr_emp02, which is an input
        to the next step. 

17. Construct_concrete_demand_density.sas creates the 2002 and 2007 concrete demand density datasets that are used
      as inputs to step 18.  The concrete demand density dataset is constructed from 4 public-use datasets:
      (i) city_county_07_areas.csv -- described in the file README_city_county.txt.
      (ii) BEAeastructure.csv -- described in README_BEA_files.txt.
      (iii) constr_emp02.sas7bdat -- created in the previous step.
      (iv) County_Construction_Employment_2007_raw.csv -- county-level construction employment from the public-use
           2007 County Business Patterns, described in README_cbp.  

18. Create_concrete_datasets.sas reads in 2002 and 2007 concrete sample and creates 2 datasets: (i) concrete_0207_gooddata, which contains the 2002 and 2007 concrete data, but with data imputed by the Census Bureau replaced with missing values--these data are merged with the 2002 and 2007 concrete demand density data and then used as inputs to step 19 as well as step 27;(ii) concrete_0207_w_CB_imputes--this is just the original 2002 and 2007 concrete data (including Census Bureau imputes and non-imputed data) merged with the concrete demand density data.  These data are used as inputs to step 27. 

19. export_concrete_samples_for_CART_imputation.sas.  This program exports the 2002 and 2007 non-imputed concrete data from SAS to comma separated (.csv) files so that can be read in by the R scripts that do the CART imputations.  s

The following two R scripts were executing using R version 2.15.2 (2012-10-26) -- "Trick or Treat"
Platform: x86_64-redhat-linux-gnu (64-bit)

20. multiply_impute_for_concrete02_script.R. The R script uses the 2002 non-imputed concrete data to create 500 CART-completed datasets and 500 CART-predicted datasets.  

21. multiply_impute_for_concrete07_script.R uses the 2007 non-imputed concrete data to create 500 CART-completed datasets and 500 CART-predicted datasets.  

22. import_concrete_cross_section_imputes.sas imports the 500 CART-completed concrete datasets for 2002 and 2007 from .csv files to, merges on plant-level and firm-level identifiers and saves them as SAS datasets.  These datasets are used as inputs to steps 26 and 27. 

23. import_concrete_cross_section_predicted.sas imports the 500 CART-PREDICTED concrete datasets for 2002 and 2007 from .csv files to, merges on plant-level and firm-level identifiers and saves them as SAS datasets.  These datasets are used as inputs to step 26. 

24. importmergeBLS_capital_327.sas.  This program creates the bls_capital_327 SAS dataset that is in an input to 
       program Construct_ind_cost_shares_concrete.sas in step 25.
       The files README_BLS_327.txt describes the public-use files that are inputs to the importmergeBLS_capital_327.sas program,
       all of which are included in the replication folder.


25. Construct_ind_cost_shares_concrete.sas.  Using the NBER-CES productivity database merged with the BLS capital data for concrete,
        bls_capital_327.sas7bdat, to calculate cost shares for the concrete industry in 2002 and 2007.  These cost shares are used
        in  estimate_tfp_dispersion_by_year_no_DA.sas in step 26 and in estimate_prod_funcs_on_concrete.sas in step 27. 

26. estimate_tfp_dispersion_by_year_no_DA.sas calculates within-industry IQRs of plant-level TFPR in the concrete industry in 2002 and 2007 using Census Bureau completed data and on each of 500 CART-completed datasets and 500 CART-predicted dataset.  It calculates the mean of the 500 CART-based estimates.  These means and the IQRs from the Bureau-completed data are reported in the concrete rows of columns 1 and 2 of table 4 in the article.     


27.  estimate_prod_funcs_on_concrete.sas estimates IQRs of TFPR for the concrete industry in 2002 and 2007 from: (i) non-imputed ("good") data; (ii) Census Bureau completed ("CB") data; and (iii) each of 500 CART-completed datasets.  The program also calculates the means of the 500 CART-completed concrete TFPR IQRs for 2002 and 2007.  The CART IQR estimates produced by this program are used as inputs to step 28.

28. Compute_PPD_checks_TFP_dispersion_concrete.sas computes TFPR dispersion statistics from 500 pairs of concrete datasets each for 2002 and 2007. In each pair, the first dataset is a CART-completed datasets (where each Census Bureau-imputed item is replaced with a CART imputations) and the second datastes is CART-predicted datasets, where for any variable with any Bureau-imputed data, ALL of the observations are replaced with a CART imputation, using the same CART tree that was built for the first dataset in the pair. For each of the 500 pairs of datasets, the program then computes the TFPR IQR.  The program also reports the average difference between the CART-completed mean of the TFPR IQRs and the CART-predicted mean.  This difference and the difference divided by the CART-completed TFPR IQR are reported in the "concrete" row of columns 1 and 2 of table 5.
        

29. START_322211_BOXES_2002_07.sas, START_312113_ICE_2002_07.SAS. Programs that start with raw Census of Manufactures data and 
         create datasets that are used as inputs to program Create_real_values.sas in step  33.

30.. importmergeBLS_capital_FHS_inds.sas.  Analogous to importmergeBLS_capital_327.sas, this programs creates 
         the BLS_capital_FHS SAS dataset that is in an input to the program  in step 31.

31. Construct_ind_cost_shares_FHS_industries.sas  calculates industry cost shares for Foster, Haltiwanger and Syverson (2008)
        industries in 2002 and 2007 using the NBER-CES productivity database and the BLS_capital_FHS created in the previous step. 
        These cost shares are used in the program Create_real_values.sas in step 33.

32. Create_entry_exit_flags.sas.  This program creates the entry/exit/continuer flags that are used in the next step.

33. Create_real_values.sas: (i) Selects samples from initial industry datasets for boxes and ice (as well as a few other industries which are not discussed in the final version of the paper); (ii) merges in deflators from NBER-CES productivity database to create real values for flow variables; (iii) merges in BEA/BLS data to construct real capital stocks; (iv) merges in entry/exit/continuer binary variables;  (v) merge in industry cost shares; (vi) identifies and creates indicators for imputed data; (vii) saves the data in datasets "phyboxesf" and "phyicef" (phy for physical quantity), to be used in step 39.

34. export_FHS_for_CART_imputation.sas Program that replaces imputed data in the 2002 and 2007 FHS industries with missing values 
         exports the non-imputed datasets as .csv files for inputs to the CART imputation scripts.  

35. multiply_impute_for_boxes_script02.R, multiply_impute_for_ice_script.R:  R scripts that create 500 CART-completed and 
         CART-predicted datasets for the 2002 ice and boxes industries and the 2007 
         ice industry. These R scripts were run using R version 2.15.1 (2012-06-22) -- "Roasted Marshmallows".

36. import_FHS_CART_imputes.sas. Program that imports the CART-predicted and CART-completed .csv files for ice and boxes 
         into SAS datasets which are inputs to Create_real_values_from_CART_imputes.sas and Create_real_values_from_CART_predicted.sas. 

37.  Create_real_values_from_CART_imputes.sas.  Analogous to Create_real_values.sas, but using 500 CART-completed datasets as 
         inputs.

38.   Create_real_values_from_CART_predicted.sas. The same as Create_real_values_from_CART_imputes.sas, but using CART-predicted
          datasets. 

39. FIX_OUTLIER_PPSR50_2002_07.sas. -- Reads in physical quantity data for boxes and ice for 2002 and ice for 2007; creates TFPR, TFPQ, and prices measures and trims outliers as in Foster, Haltiwanger and Syverson (2008) [FHS]; calculates and prints out the TFPR, TFPQ, and price dispersion measures in columns 1, 3, and 5 of table 4.  

40: PPD_checks_TFP_price_dispersion.sas compute within-industry IQRs of TFPR, TFPQ and price from 500 pairs of CART-completed and CART-predicted datasets for boxes (2002) and ice (2002 and 2007).  The means of the 500 estimates from CART-completed data are reported in columns 2, 4 and 6 of table 4 in the paper.  The mean differences between the CART-completed IQRs and the CART-predicted IQRs for boxes and ice and the difference divided by the CART-completed mean IQRs are presented in table 5. 


*****************************
Programs and datasets to produce tables 6 and A1 in the article (exit probits): 
replicating table 6 of Foster, Haltiwanger, and Syversion (2008) and doing robustness analysis on CART-completed data: 
*****************************


41. table6_aer.do.  This Stata do file produces the results in table A1 of White, Reiter and Petrin,
                which are just a replication of table 6 in FHS (2008, AER).
                This is FHS's do file. The only modification is that the file 
                paths were changed so that the program would run in the White, Reiter, and Petrin project space. 
                Lucia Foster, John Haltiwanger and Chad Syverson also gave us access to their estimation samples
                from the 1977-1997 CMF data.  These samples are also available upon request for researchers with 
                approved access to the CMF in the Federal Statistical Research Data Centers.

42. match_cmf_flags_to_fhs_pen.sas.  This programs matches the edit/impute flags from the 1977-1997 Censuses of Manufactures
                (described in White (2014) "Recovering the Item-Level Edit and Imputation Flags in the 1977-1997 Censuses of Manufactures" https://ideas.repec.org/p/cen/wpaper/14-37.html), to FHS's penultimate datasets.
                FHS's penultimate datasets, which were made available to us by Lucia Foster, John Haltiwanger,
                and Chad Syverson, are their industry sample just prior the final trimming of outliers and variable
                creation.  We used a modified version of their outlier trimming program in the next step. 
                The item-level edit/impute flags for the 1977-97 CMFs will be made available upon request
                to any researchers with approved access to the 1977-1997 Censuses of Manufactures in the 
                Federal Statistical Research Data Centers.

43.fix_outlier_ppsr50_make_PQS_CM_TVS_CF_EE_TIB_TIE_imps_missing.sas. This programs reads in the FHS industry datasets for
                1977-1997 with item level edit/impute flags, replaces the Census Bureau's imputations with missing values,
                selects only the plants that are in FHS's final datasets, and saves the data.

44. export_FHS7797_for_CART_imputation.sas.  Reads in the output datasets from the previous step, 
                fix_outlier_ppsr50_make_PQS_CM_TVS_CF_EE_TIB_TIE_imps_missing.sas, and exports 
                the FHS 1977-1997 SAS datasets (where Census Bureau imputations
                have been replaced with missing values) to .csv files for CART imputation in the next step.
                Before exporting, the program transforms some of the variables to ensure that certain logical
                relationships are satisfied by the CART imputations (e.g., production worker wages cannot be 
                greater than total salaries and wages for the same plant).

45. R scripts for creating 500 CART-completed versions of FHS datasets for 1977-1997 (these R scripts were run using R version 3.1.3 (2015-03-09) -- "Smooth Sidewalk"):
multiply_impute_for_boxes_script.R, multiply_impute_for_bread_script.R, multiply_impute_for_carbon_script.R,
multiply_impute_for_coffee_script.R, multiply_impute_for_floor_script.R, multiply_impute_for_gas_script.R, 
multiply_impute_for_iceb_script.R, multiply_impute_for_icep_script.R, multiply_impute_for_plywood_script.R.  

46. The R scripts for the CART-completed 1977-1992 concrete datasets were run using R version 3.1.0 (2014-04-10) -- "Spring Dance":
multiply_impute_for_concrete77_script.R, multiply_impute_for_concrete82_script.R, multiply_impute_for_concrete87_script.R,
multiply_impute_for_concrete92_script.R
 
47. import_FHS_CART_imputes7797.sas. Imports the CART-completed datasets for boxes, bread, carbon black, coffee, flooring, gas,
                 process ice, block ice, and plywood from .csv files into SAS datasets, merges on plant and 
                 firm identifiers, and saves the SAS datasets as <industry name>_imputes.

48. import_FHS_concrete_CART_imputes.sas.  Imports the 1977-92 CART-completed datasets for concrete
                 from .csv files, merges on plant and firm identifiers, and saves the SAS datasets as conc_imputes.


49. Create_regression_vars_from_CART_imputes.sas.  This program reads in the CART-completed SAS datasets for the 1977-1997 FHS
                  industries, constructs the variables that will be used in the FHS exit probit regressions.
                  This program is a modified version of the program that Foster, Haltiwanger and Syversion used
                  to create their estimation variables and to select their estimation samples. 
                  The output SAS datasets are all<industry name>_postCART.  

50. create_dshk99.sas. Using the output datasets from Create_regression_vars_from_CART_imputes.sas, 
                   construct demand shocks as described in FHS (2008).

51. merge_dshk99_with_postCARTdata.sas.  Merge the demand shocks from the CART-completed FHS data with the 
                rest of the CART-completed FHS data (the output datasets from Create_regression_vars_from_CART_imputes.sas).  
		The output datasets from this merge are inputs to the 
                export_FHS_w_dshk99_to_stata.sas program in the next step. 

52. export_FHS_w_dshk99_to_stata.sas.  Exports the CART-completed versions of the FHS datasets, including demand shocks,
                to a .csv file which is read in the next step by fillin_impute_for_nonimputed_industries.do.  

53. fillin_impute_for_nonimputed_industries.do.  For the sugar industry sample in the FHS data, there were very few 
                data items imputed by the Census Bureau, so we did not replace them with CART imputations.
                The next program, table6_aer_on_CART_imputes_w_dshk99.do, requires 500 sets of observations for each industry.
                So fillin_impute_for_nonimputed_industries.do just creates 500 copies of the sugar data
                and saves it with the rest of the CART-completed data as allwt_CART_full_d99.dta.

54. table6_aer_on_CART_imputes_w_dshk99.do. Using the allwt_CART_full_d99.dta dataset created in the previous step, 
                this program produces the results presented in table 6 of White, Reiter and Petrin.
                The sample of plant-years is the same as in table A1. The difference is that the data items that 
                were imputed by the Census Bureau in the A1 sample have been replaced by 500 imputations using the
                sequential CART method described in the paper.  The FHS exit probit regressions are run separately
                on each of the 500 CART-completed estimates.  The marginal effect estimates presented in table 6
                are the means of the 500 CART-completed estimates.  The standard errors are the standard errors from
                the 500 regressions combined using Rubin's combining formula.  



                



  
